A shared file system abstraction for heterogeneous architectures
نویسندگان
چکیده
We advocate the use of high-level OS abstractions in heterogeneous systems, such as CPU-GPU hybrids. We suggest the idea of an inter-device shared file system (IDFS) for such architectures. The file system provides a unified storage space for seamless data sharing among processors and accelerators via a standard wellunderstood interface. It hides the asymmetric nature of CPU-accelerator interactions, as well as architecturespecific inter-device communication models, thereby facilitating portability and usability. We explore the design space for realizing IDFS as an in-memory inter-device shared file system for hybrid CPU-GPU architectures. 1 The case for better abstractions Recent years have seen increasingly heterogeneous system designs featuring multiple hardware accelerators. These have become common in a wide variety of systems of different scales and purposes, ranging from embedded SoC, through server processors (IBM PowerMP), and desktops (GPUs) to supercomputers (GPUs, ClearSpeed, IBM Cell). Furthermore, the “wheel of reincarnation” [8] and economy-of-scale considerations are driving toward fully programmable accelerators with large memory capacity, such as today’s GPUs1. Despite the growing programmability of accelerators, developers still live in the “medieval” era of explicitly asymmetric, low-level programming models. Emerging development environments such as NVIDIA CUDA and OpenCL [1] focus on the programming aspects of the accelerator hardware, but largely overlook its interaction with other accelerators and CPUs. In that context they ignore the increasing self-sufficiency of accelerators and lock the programmers in an asymmetric CPU-centric model with accelerators treated as co-processors, secondclass citizens under CPU control. We argue that this idiosyncratic asymmetric programming model has destructive consequences on the programmability and efficiency of accelerator-based systems. Below we list the main constraints induced by this asymmetric approach. 1NVIDIA GPUs support up to 64GB of memory. Problem: coupling with CPU process. An accelerator needs a hosting CPU process to manage its (separate) physical memory, and invoke computations; the accelerator’s state is associated with that process. Implication 1: no standalone applications. One cannot build accelerator-only programs, thus making modular software development harder. Implication 2: no portability. Both the CPU and the accelerator have to match program’s target platform. Implication 3: no fault-tolerance. Failure of the hosting process causes also state loss of the accelerator program. Implication 4: no intra-accelerator data sharing. Multiple applications using the same accelerator are isolated and cannot access each others’ data in the accelerator’s memory. Sharing is thus implemented via redundant staging of the data to a CPU. Problem: lack of I/O capabilities. Accelerators cannot initiate I/O operations, and have no direct access to the CPU memory2. Thus, the data for accelerator programs must be explicitly staged to and from its physical memory. Implication 1: no dynamic working set. The hosting process must pessimistically transfer all the data the accelerator would potentially access, which is inefficient for applications with the working sets determined at runtime. Implication 2: no inter-device sharing support. Programs employing multiple accelerators need the hosting process to implement data sharing between them by means of CPU memory. Problem: no standard inter-device memory model. Accelerators typically provide a relaxed consistency model [1] for concurrent accesses by a CPU and an accelerator to its local memory. Such a model essentially forces memory consistency at the accelerator invocation and termination boundaries only. Implication: no long-running accelerator programs. Accelerator programs have to be terminated and restarted by a CPU before they can access newly staged data. Implication: no forward compatibility. Programs using explicit synchronization between data transfers and accelerator invocations will require significant programming efforts to adapt to more flexible memory models likely to become available in the future. 2NVIDIA GPUs enable dedicated write-shared memory regions in the CPU memory, but with low bandwidth and high access latency.
منابع مشابه
Transparent Remote File Access Through a Shared Library Client
This paper presents the implementation of the ORFA client. ORFA aims at providing an efficient access to remote file systems through high-speed local networks such as MYRINET. The ORFA client is a lightweight shared library that may be pre-loaded to override standard file access routines to allow remote file access for any legacy application. In ORFA, virtual file descriptors have been designed...
متن کاملExperience Using a Globally Shared State Abstraction to Support Distributed Applications
In this paper, we evaluate the effectiveness of basing distributed systems on a persistent globally shared address space abstraction, as implemented by Khazana. Khazana provides shared state management services to distributed application developers, including consistent caching, automated replication and migration of data, location management, access control, and (limited) fault tolerance. We r...
متن کاملDesign of a novel congestion-aware communication mechanism for wireless NoC architecture in multicore systems
Hybrid Wireless Network-on-Chip (WNoC) architecture is emerged as a scalable communication structure to mitigate the deficits of traditional NOC architecture for the future Multi-core systems. The hybrid WNoC architecture provides energy efficient, high data rate and flexible communications for NoC architectures. In these architectures, each wireless router is shared by a set of processing core...
متن کاملVasculature segmentation using parallel multi-hypothesis template tracking on heterogeneous platforms
We present a parallel multi-hypothesis template tracking algorithm on heterogeneous platforms using a layered dispatch programming model. The contributions of this work are: an architecture-specific optimised solution for vasculature structure enhancement, an approach to segment the vascular lumen network from volumetric CTA images and a layered dispatch programming model to free the developers...
متن کاملCosh: Clear OS Data Sharing In An Incoherent World
This paper tackles the problem of providing familiar OS abstractions for I/O (such as pipes, network sockets, and a shared file system) to applications on heterogeneous cores including accelerators, co-processors, and offload engines. We aim to isolate the implementation of these facilities from the details of a platform’s memory architecture, which is likely to include a combination of cache-c...
متن کاملArchitectural Issues in Adopting Distributed Shared Memory for Distributed Object Management Systems
Distributed shared memory (DSM) provides transparent network interface based on the memory abstraction. Furthermore, DSM gives us the ease of programming and portability. Also the advantages ooered by DSM include low network overhead, with no explicit operating system intervention to move data over network. With the advent of high-bandwidth networks and wide addressing, adopting DSM for distrib...
متن کامل